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ABSTRACT 


Data exfiltration o ver a n etwork p oses a t hreat toe onfidential in formation. Du e to the 
possibility of malicious insiders, this threat is especially difficult to mitigate. Our goal 
is to contribute to the development of a method to detect exfiltration o f m any targeted 
files without incurring the full cost of reassembling flows. One strategy for accomplishing 
this would be to implement an approximate matching scheme that attempts to determine 
whether a file is being transmitted over the network by analyzing the quantity of payload 
data that matches fragments of the targeted file. Our work establishes the basic feasibility 
of such an approach by matching Transmission Control Protocol (TCP) payloads of traffic 
containing exfiltrated data against a database of MD5 hashes, each representing a fragment 
of our target data. We tested against a database of 415 million fragment hashes, where 
the length of the fragments was chosen to be smaller than the payload size expected for 
most common Maximum Transmission Units (MTUs), and we simulated exfiltration by 
sending a sample of our targeted data across the network along with other non-target files 
representing “noise.” We demonstrate that under these conditions, we are able to detect the 
targeted content with a recall of 98.3% and precision of 99.1%. 
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CHAPTER 1: 

Introduction 


Data exfiltration is the “unauthorized transfer of sensitive information from a target’s 
network to a location which a threat actor controls” [1]. Recent data leaks, such as Edward 
Snowden’s release of classified data [2], the Office of Personnel Management’s loss of 
sensitive records in 2015 [3], or the Sony hack of 2014 [4], have brought increased attention 
to the potential damage that can be caused by malicious data exfiltration. Each of these 
incidents highlights the potential consequences of compromise and data loss to government, 
private organizations and private citizens. Arguably, the Office of Personnel Management 
incident was the most damaging, with the disclosure of personal information for over 22 
million government officials, contractors, and their friends and families. In the Sony hack, 
the primary target was a private company rather than a government organization, and the 
primary consequence was financial damage caused by the loss of over 100 terabytes of 
data, including proprietary information. However, collateral damage included the loss 
of personnel details such as e-mails, salary information for executives, usernames and 
passwords. These incidents strongly argue for the importance of protecting the sensitive 
documents of the government and proprietary information or content of private companies. 

Data exfiltration can take many forms depending on the quantity of data targeted by the 
attacker and the system on which the data is stored. Small artifacts, such as passwords 
or encryption keys, might be removed using low-bandwidth out-of-band methods. Eor 
example, cell phones send and receive Global System for Mobile communication (GSM) 
frequencies to communicate with cell towers and other cell phones. However, there is 
a possibility that transmissions using these GSM frequencies can be made to perform 
unauthorized tasks, including the transmission of personal or contact data that exists on 
the compromised cell phone. Eurther, Guri et al. were able to use these GSM frequencies 
to obtain information from a desktop computer by manipulating memory to produce GSM 
transmissions. A compromised Android cell phone placed within a close range to the 
desktop was able to capture and demodulate these signals [5]. 

A sneakernet attack is another data exfiltration method in which data is exfiltrated using a 
hard drive, flash drive, CD, or other media [6]. An insider might have an opportunity to 
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copy this data and information from the system to an external media deviee, such as a flash 
drive or hard drive. 

In order to address problems eaused by current vulnerabilities, some eompanies and ageneies 
may train their employees to better proteet their eredentials and show them what different 
attaeks look like to better proteet the network. Others, might look at implementing more 
physical or electronic security, whether it be more eameras or guards at work, or more 
firewalls and data exfiltration deteetion systems. However, this still does not eompletely 
prevent data from leaving the network. 

Current network data exfiltration systems rebuild paeket streams to observe and read the 
payload. By building stronger network exfiltration detection systems, we hope to ereate a 
system to detect data loss. 


1.1 Motivation 

There are a large amount of paekets that flow through networks constantly. Some systems 
analyze individual paekets using full paeket inspeetion, where eaeh paeket’s payload will 
be analyzed to determine if there is anything malieious in the payload. There is also 
the teehnique of flow reassembly, whieh puts the payloads of all packets in a flow baek 
together in the eorreet order to rebuild the original doeument. We hope to improve the 
ability of a network operator to stop data exfiltration attaeks. Typieal systems for deteeting 
data exfiltration require full paeket inspeetion, whieh usually involves reassembling paeket 
streams. Paeket reassembly ean take time, whieh organizations often do not have enough 
of. Resourees ean be overloaded and might not be able to handle the stress of eonstantly 
reassembling paeket data. Dharmapurikar and Paxson built a robust TCP stream reassembly 
maehine. They were able to store up to 16 million eonneetion reeords with 512 MB of 
SDRAM [7]. However, if an adversary eould overwhelm the system, that eould render the 
flow reassembly system useless. 

We hope to eontribute to the eventual development of a method that uses approximate 
matehing to find target data without reeonstrueting streams of network trafiie. Signature- 
based methods ean sometimes struggle with finding files that have been slightly altered. 
This is because hashes or signatures are built with the idea that they will see the file in 
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the same state. However, if an adversary modifies the file, even by one byte, that will 
completely change the hash of the file and may cause a signature-based system to miss the 
file. Approximate matching has the ability to take pieces from dilferent parts of the file and 
see if they might match hashes in a database. This method might miss the altered part of a 
file, but would not miss the other pieces. 

Our goal in this thesis is to establish a foundation for developing such a technique by first 
demonstrating that we can correctly match payload fragments to file fragments. Theoret¬ 
ically, such a system could detect fragments more quickly and with fewer resources than 
would be required for stream reassembly, though we leave comparison of these performance 
metrics to future work. In addition, we focus exclusively on in-band data that is transmitted 
over the network without obfuscation; out-of-band data is not within the scope of this thesis 
and we do not address encryption or obfuscation of the payloads of the packets captured. 

Our method looks at packets individually, and makes a determination when looking at each 
packet. Each packet’s payload is hashed to determine if the hash matches something in 
the database. If the hash matches, that might give an indication that the file exists in our 
blacklist. If there are multiple hashes of packet payloads that match to hashes of the same 
file fragments, then it is highly likely that the file exists in the network traffic. Ultimately an 
approximate matching system will incorporate a strategy for translating these raw matches 
into a decision regarding whether the file is or is not present in the traffic; we leave the 
development of this strategy for future work. 


1.2 Contributions 

We made the following contributions: 

• We built a system that uses open source tools to search for target content in traffic 
without flow reconstruction. 

• We demonstrated our method using a database that contained 991 files and 415 million 
MD5 hashes in our black list, representing targeted content. 

• We demonstrate that we are able to detect the targeted content with a precision of 
98.3% and recall of 99.2%. 
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1.3 Structure of this Thesis 

Chapter 2 is Technical Background, describing important concepts in order to better under¬ 
stand the thesis. Chapter 3 is Related Work, which looks at work that has been done in the 
past in this research area. Chapter 4 is Methodology, Chapter 5 is the cap of the thesis, and 
Chapter 6 is the Conclusion and Future Work. 
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CHAPTER 2: 
Technical Background 


This chapter talks about some key terms and concepts that are important to this thesis. First 
we describe definitions relevant to network exfiltration. We then describe concepts and 
terminology associated with target data leaving the network. Finally, we discuss the tools 
that were used to generate data for this thesis. 


2.1 Networking Terms 

To facilitate our discussion on data exfiltration, it is useful to begin with a review of some 

basic networking terminology. 

• Flow. An IP flow is defined as a “set of packets, observed in the network within some 
time period, that share a common key” [8]. The flow key consists of the source IP, 
source port, destination IP, destination port and protocol [8]. Flow level statistics 
give insight into the number and type of conversations occurring on a network, but 
not the content of the communications. For this reason the amount of data required 
to describe flows is fairly small and storage-efficient compared to storing full-packet 
captures. 

• Flow Reassembly. Flow reassembly, also called stream reassembly, is the process of 
completely restoring an entire TCP, or other data stream, to the correct order, often 
to inspect application-layer data. For example, a file might be reassembled at the end 
of a download [7]. 

• Maximum Transmission Unit: The maximum transmission unit, or MTU, for a 
packet is the “maximum sized datagram that can be transmitted through the next 
network” [9]. If there is a difference in MTU between the source and destination, the 
MTU will be reduced to the smaller of the two and packets will then be fragmented 
or dropped [10]. The common MTU size for an Ethernet packet is 1,500 bytes [11]. 

2.2 Network Exfiltration Concepts 

We briefly describe some basic concepts that are related to data exfiltration. 
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• Attacker: Attackers are typieally divided into two main eategories: threatening 
insiders and unauthorized intruders. In their risk management approaeh to insider 
threat, Bishop et al. observe that threatening insiders ean have aeeess to maehines on 
a eorporate network; they ean also have tokens, or other eredentials that will enable 
them to aeeess different areas of a building; finally, they ean have intimate knowledge 
of eertain elassified or eonfidential projeets [12]. Moreover, Bishop et al. state that 
there are three main eharaeteristies that an insider might have: aeeess, knowledge, 
and trust. Aeeess is usually a faetor beeause an employee has to be able to aeeess their 
workspaee. Knowledge might be a faetor, espeeially if there is elassified data involved. 
Trust is another key faetor implied by privileged aeeess, sueh as authorization to 
aeeess a seeure perimeter [12]. For example, system administrators have power with 
respeet to the network that they administer. Attackers who lack the speeial privileges 
of insiders are typieally eonsidered external, unauthorized intruders. Unauthorized 
intruders may use a variety of attaeks to gain aeeess to proteeted resourees, ranging 
from soeial engineering to exploitation of vulnerable serviees. Onee aeeess has 
been achieved to an exploited network, intruders often attempt to exfiltrate sensitive 
data [13]. 

• Intrusion Detection: An intrusion detection system is defined by Dorothy Den¬ 
ning as a system that “aims to deteet a wide range of seeurity violations ranging 
from attempted break-ins by outsiders to system penetrations and abuses by insid¬ 
ers” [14]. Intrusion deteetion foeuses on trying to identify potential threats to a 
network. Whether the intrusions are found by seanning for known signatures (see 
Seetion 2.3.2), or by identifying statistieal anomalies, the most eommon proeedure 
is to generate an alert that is sent to an administrator for review. Depending on what 
the system protoeol is, an administrator eould review the alert and take an aetion, or 
the system eould initiate an automated response to the alert that is generated by the 
intrusion deteetion system. 

• Exfiltration: Trend Miero defines data exfiltration as the “unauthorized transfer of 
sensitive information from a target’s network to a loeation whieh a threat aetor eon- 
trols” [1]. Giani et al. argue that teehniques for deteeting data exfiltration require 
the “ability to make a distinetion between legitimate and malieious information eom- 
munieation” [15]. A data exfiltration detection system determines whether or not 
sensitive information has left an information system. Often this is aeeomplished by 
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consulting a black list of files that should not be allowed to leave the network. A data 
exfiltration prevention system operates in a similar manner but will also attempt to 
prevent unauthorized information from leaving the system. 

• Firewall: A firewall is a device or program designed to establish a perimeter around 
a network and create a single point of entry where security policies can be enforced 
and auditing can be performed [16]. Security policies for firewalls are enforced by 
the rulesets that are used. Each firewall contains a ruleset that determines what will 
happen when each packet is seen. These rulesets can be based on protocol, time to 
live, IP address, port number, and other different features. 

There are two different types of firewalls: stateful and stateless [17]. Stateless 
firewalls are meant to filter out policy violations that can be easily detected without 
reassembling packets, such as unauthorized connections to services like FTP and SSH 
connections, as well as RDP and MSSQL. Stateful firewalls are designed to complete 
more advanced tasks than stateless firewalls [18]. They are capable of observing an 
entire trafhc stream. For example, if the stateful firewall sees a SYN-ACK packet, 
when it never saw a SYN packet, it can be configured to drop that packet. The stateless 
firewall would let the SYN-ACK packet pass because it does not maintain a record of 
previous packets. 

• Deep Packet Inspection : Firewalls capable of deep packet inspection are able to ex¬ 
amine the application layer payloads of packets entering and leaving the network [19]. 
Such systems may use signatures, statistics or heuristics against payload content to 
attempt to enforce network policy. They may also be capable of flow reconstruction— 
i.e., the process of rebuilding the payloads across multiple packets in a flow. 


2.3 Target Data Detection Concepts 

The need to distinguish targeted content from surrounding data and locate it quickly is not 
confined to the analysis of network traffic. This task is frequently also used in digital foren¬ 
sics analysis performed on secondary storage devices. Some of the terms and techniques 
described in this section are adopted from that problem domain. 
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2.3.1 Target Data 

We define target data as the data we attempt to detect in our traffic. In our experiment, this 
data comes from the Govdocs data corpus and is intended to represent intellectual property 
or sensitive documents that are not authorized for transmission over the network [20]. 

2.3.2 Signature 

A signature is a string derived from some digital object, often for the purpose of creating a 
compressed representation of that object. Signatures are designed such that if two signatures 
are identical, there is a high probability that their objects are also identical. Ideally, no other 
pair of documents should have the same digital signature. 

2.3.3 Signature-based Detection 

Signature-based detection is used in intrusion detection systems, like Snort and Suri- 
cata [21 ], [22]. Packets and flows are collected from the network and inspected to determine 
if any data will match a signature in a ruleset. If the packet’s content matches a signature in 
the ruleset, an alert will be generated. Rule-based detection focuses on having content of 
packet match a certain signature, whether based on IP, port number, or certain bytes in the 
payload’s content. All of these items and more make up a signature that Snort and Suricata 
would look for. 

In our work, we implement a kind of signature-based detection based on checking hashes of 
packets’ payloads to see if they match against our database. In contrast to traditional rule- 
based methods, approximate matching focuses on finding similarities between the content 
of the packet payload and the target data. When knowing that multiple pieces of files exist 
in the payload, we could use the data gathered to match hashes of the file pieces to hashes 
in a database. 

2.3.4 Cryptographic Hash 

A cryptographic hash is a one-way function that takes arbitrary length input and generates 
a fixed-length output. Because the length of the output is constant, and the hashing function 
can be applied to an entire file, cryptographic hashes are a compact way of representing 
content. One application of a cryptographic hash is to determine a file’s authenticity. For 
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example, hashes are provided online by many software vendors for their customers. It is 
possible to compare the hashes on the vendor’s website with those hashes of the files that 
you download. Another helpful application of cryptographic hashes is digital signatures. 
As we will demonstrate in Section 4.5, hashes allow us to build compact signatures, which 
enables us to store many different hashes in a database. 

The most common types of hashes are MD5, SHAl, and SHA256. MD5 hashes are 16 bytes 
long, SHAl hash is 40 bytes long, and SHA256 is 64 bytes long. It will take more time for 
an SHA256 hash to be generated than an MD5 hash because of the size of the complexity 
of the algorithm. A comparison of the speed of these algorithms run on a MacBook Pro, 
with a 2.3 Ghz Intel Core 17 processor, and 16 GB of RAM, is shown in Table 2.1. The 
numbers shown are in kilobytes per second. 


Block Size 

16 Bytes 

64 Bytes 

256 Bytes 

1024 Bytes 

8192 Bytes 

MD5 

42.35 

121.72 

265.83 

379.52 

430.10 

SHAl 

44.81 

131.57 

285.54 

412.20 

456.88 

SHA256 

33.63 

80.67 

146.01 

178.94 

191.52 


Table 2.1: Speed of three hash functions in megabytes per second when run 
repeatedly on different block sizes. Note that SHA256 is considerably slower 
than MD5. 


The main hashing function that was used in this thesis was MD5. Even though MD5 is not 
secure, our work does not depend on the security features that an MD5 hash would provide. 
We are using it to provide us with a simple signature that we can store in a database. 


2.3.5 File Target Detection with Hashing 

A use of cryptographic hashes is to create signatures for content so it can be quickly 
identified. There are a few advantages of cryptographic signatures. They are lightweight 
and fast. They also allow the investigator to search for contraband without possessing illegal 
files themselves. Signatures created from entire files also have some limitations. One of 
the limitations of signatures created from an entire file is that if even one byte changes, the 
entire signature will change, and a slight variation might not be found. This is especially 
problematic for certain file formats, including Microsoft Word documents, which update 
time stamps when the date is saved. This change is transparent to the user. 
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2.3.6 Approximate Matching 

Approximate matching is defined in the NIST Special Publication 800-168 as “a generic term 
describing any technique designed to identify similarities between two digital artifacts” [23]. 
An artifact is defined as a byte sequence of arbitrary length that exists in a file. Whereas 
any cryptographic hash that is used to compare the two artifacts will only detect a full 
match or no match at all, approximate matching produces a result that indicates the degree 
of similarity between two digital artifacts. 

File fragments are pieces of files [24]. Hard drives, cell phones, network traffic files and 
other digital media, have a lot of data that has to be processed. By splitting this data into 
pieces, these pieces can be placed into a database or data store. These pieces of files can 
also provide insight as to where the data is in relation to the drive, network traffic, or other 
media used. 

These pieces of files are then hashed and stored into a database. These hashes are of different 
fragment sizes are important because of the division of the files. By having these files in 
pieces, the hashes can be used to try and find pieces of files on systems or in internet traffic. 
Our searches can move faster with fragment hashes because we can quickly determine if a 
piece of a file’s hash is present in a database or data store. 


Rabin Fingerprinting 

Michael Rabin in 1981 released his idea for fingerprinting of files using random polynomi¬ 
als. Rabin states that he will “use irreducible polynomials to ‘fingerprint’ files so that any 
unauthorized change will be detected with a very high probability” [25]. Rabin’s finger¬ 
printing algorithm offers an efficient technique for dividing a file into variable sized blocks. 
The block boundaries are calculated using a method of division by random polynomials, 
such that the block boundaries remain the same even if material is inserted or deleted in the 
file. This allows fingerprints to match even if modifications have caused the alignment to 
change. 


Shin g ling 

Shingling is the process by which we hashed overlapping blocks of data. With this experi¬ 
ment, we hashed a block of data, moved one byte over, then hashed another similar block of 
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data. Shingling is commonly used when trying to find similarities between two documents. 
This term was first mentioned in Broder’s “Syntactic Clustering of the Web” in 1997 [26]. 


sdhash 

sdhash is a “tool that allows two arbitrary blobs of data to be compared for similarity based 
on common strings of binary data” [27]. When building hashes, sdhash will use a sliding 
hash, which is similar to Broder’s shingling method. However, a limited number of shingles 
are selected for the fingerprints. These shingles are selected based off of statistical analysis 
that will attempt to guess which shingles are most likely to be correlated with the data that 
is being fingerprinted. 


Hash-based Carving 

Hash-based carving is a technique that is used to detect the presence of target data on 
secondary media storage by evaluating the hashes of individual data blocks. It can locate 
known files by “recognizing a target file on a piece of searched media by hashing same-sized 
blocks of data from both the file and the media and looking for hash matches” [28]. This 
work provides inspiration for the databases that are used later in the thesis. These databases 
contain hashes of data blocks that look for pieces of data. 


2.4 Tools 

This section discusses a few of the tools that we used in this thesis. 


2.4.1 bulk_extractor 

Bulk_extractor is a forensic analysis tool designed for directly extracting artifacts of forensic 
value from storage media or disk images. The program is capable of looking at hard drive 
images, directories that contain many files, network traffic, cell phone images, and other 
types of digital media [29]. bulk_extractor ignores file system structure and runs directly 
on the underlying data blocks. This strategy allows it to process data on multiple cores at 
once, speeding up the evidence extraction process. 
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There are different scanners that can be used with bulk_extractor. Scanners include e-mail, 
PDF, RAR, and hashdb. These scanners will look for specific data when looking at the data 
that is given. We make use of the hashdb scanner in this thesis. See Section 2.4.2. 

2.4.2 hashdb 

Hashdb is a key-value store for storing block hashes. Its main capabilities include creating 
hash databases of MD5 block hashes, importing block hash values, scanning a hash database 
for matching hash values, and providing source information for hash values [30]. 

Hashdb is designed to support block hashing instead of full file hashing. This is useful 
because there are instances where you might only find fragments of a file instead of the 
entire thing. A major design goal that motivated the creation of hashdb was to develop a 
key value store that could store billions of hashes and respond to queries quickly enough to 
make hash-based carving practical. Hashdb can also be used to analyze network traffic and 
embedded content in other documents. 

There are hashdb libraries for the Python and C-i-i- programming languages. These can be 
used in programs to use hashdb instead of using the command line. 
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CHAPTER 3: 
Related Work 


Much of the prior work in preventing data exfiltration has foeused on detecting attempts to 
transfer doeuments over the network. We briefly deseribe three broad eategories of data 
exfiltration strategies: naive exfiltration, obfuscation exfiltration, and enerypted exfiltra¬ 
tion. We then diseuss three strategies for preventing document exfiltration: behavior-based 
approaehes, eontent-based methods, and air-gapping. These eoneepts while dilferent, look 
to aehieve the same goal. Behavior-based approaehes look at rule-based systems, anomaly 
deteetion, and a eombination of rule-based systems and anomaly deteetion. Content-based 
methods dive deep into the data itself to try and find evidenee of exfiltration. An air-gap 
is a form of isolated network that prevents eonneeted devices from aeeessing the external 
networks, sueh as the internet. 


3.1 Exfiltration Strategies 

An adversary exfiltrating data ean use a variety of teehniques. Though an exhaustive survey 
of these techniques is outside the seope of this thesis, we diseuss three basie eategories 
of strategy. We eall the most basie elass of strategies “Naive exfiltration.” This refers 
to exfiltration attempts that do not attempt to prevent deteetion. Obfuseated exfiltration 
attempts to send data that has an altered form in order to cireumvent simple matehing 
sehemes. Finally, as the name suggests, enerypted exfiltration uses eneryption to avoid 
deteetion and prevent third-party aeeess to the data. 

3.1.1 Naive Exfiltration 

This thesis foeused entirely on what we refer to as “naive exfiltration.” Data was sent over 
HTTP, unobfuscated and unenerypted. Sending data over unenerypted FTP eonneetion is 
another example of this kind of exfiltration. Any tralfie sent this way would have been able 
to be captured and read by the network administrators. Beeause the only alteration to the 
data is eaused by the proeess of being sent over the network itself, it is possible to prediet 
dilferenoes between on-disk format and the format in the tralfie based on knowledge of the 
network. 
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3.1.2 Obfuscated Exfiltration 

Obfuscated exfiltrated data can be a little more difficult to read. By flipping a few bytes in 
a file, replacing similar characters with a text search (O to a 0 for example), the exfiltrator 
can completely change the hash. While the file would still read the same, the hash would 
be different, and might not be found. 

Another example of obfuscated data is changing the file extension to defeat rudimentary 
checks for unauthorized file types. Finally, compressing a file is another simple way to 
obfuscate its content. When a file is compressed, that will change the hash of the file and 
most or all fragments of the file. The technique we test would detect data where the file 
extension has been changed, and may detect some data where a find and replace operation 
has been run. However, it is unlikely to detect compressed versions of the content unless 
the compression was ineffective. 

3.1.3 Encrypted Exfiltration 

Exfiltrating encrypted data is a good way for an adversary to exfiltrate data without having 
it be seen. Sending files over HTTPS, SCP, FTPS, or sending even encrypting a zip file, 
significantly increases the difficulty of detecting the file. The only way to decrypt these 
files would be to have the encryption key, or have the ability to break SSL encryption for 
data going over HTTPS, for example. Once encryption is broken, then the files could be 
run against the database to determine whether or not it should be allowed off the network. 


3.2 Behavior-based Approaches for Preventing Document 
Exfiltration 

Behavior-based approaches look at rule-based systems, anomaly detection, and a combina¬ 
tion of rule-based systems and anomaly detection. Rule-based systems, such as firewalls 
and Snort, use rules to observe packet data and make a determination on what to do with 
that packet if it meets a certain criteria. Anomaly detection systems, such as Bro, attempt 
to gauge what normal and abnormal traffic is while observing traffic on a network. The 
combination of rule-based systems and anomaly detection systems can be effective if im¬ 
plemented properly. Giani et al. used both rules and heuristics to look for data exfiltration 
off of their network [15]. 
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3.2.1 Rule-based Systems 

Rule-based systems follow a certain set of rules that are created by a system administrator, 
or the manager of the system. These systems will look to see if the data passing through 
them matches a certain condition. If the condition is met, then the system will perform a 
previously specified action. Otherwise, the data flow will continue as normal. 


Firewalls 

The first firewall was created in 1983, at Stanford by Brian Reid [31]. The purpose of this 
firewall was to safely connect to ARPAnet. In 1988, Reid wrote a procedure manual for the 
firewall he created in 1983 [31]. In 1990, Bill Cheswick at AT&T Bell Labs began working 
on a more advanced type of internet gateway. Cheswick’s work focused on FTP and telnet 
services [32]. Earlier firewalls were stateless, meaning that packets passing through the 
firewall had to meet certain requirements and pass through a ruleset. Stateful firewalls came 
a few years later. Stateful firewalls collect data about each packet. Eventually, the firewall 
will make a determination about the packet and the state associated with it. With stateful 
firewalls potentially being compromised with denial of service attacks, an improvement was 
made to try and use a third generation firewall. This new type of firewall could determine 
if there was a port or service that was being constantly attacked. This firewall could look at 
a packet and see if it was acting like normal traffic for the type of service that was running 
on that port. Currently, the most common type of firewall is a Next Generation Eirewall 
(NGEW). This firewall does more deep packet inspection than the third generation firewall. 
It will still look at application layer data, but has many more components, such as intrusion 
detection systems and web application firewalls [31]. 


Snort 

Snort is an intrusion detection and prevention system. It has the capability to both detect 
and prevent exfiltration. It has three main modes: sniffer mode, packet logger mode, and 
network intrusion detection system mode. Sniffer mode just sniffs the trafhc to see what 
exists in the trafhc itself. Packet logger mode will log the packets Snort sniffs. Network 
intrusion detection system mode, or NIDS, will implement a ruleset and attempt to block 
trafhc depending on if something in the packet’s data matches up with a ruleset. NIDS mode 
can also act like a hrewall, where it can drop packets without them being logged depending 
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on how the rules and settings are configured on the system. Snort uses signature-based 
detection, to where it looks for the packet to have a certain IP, port, or content. Then, a 
decision will be made on what to do with the packet if the packet matches something in the 
ruleset. 

Snort signatures are capable of looking for anything in network traffic. They can be 
customized with rules that can look for certain byte patterns. For example, PDF files should 
not leave the network. In Snort, the signature can be written to look for certain byte patterns 
that are in all PDF files. This can be useful when trying to identify certain files in network 
traffic. 

Snort allows a ruleset to be implemented for each instance. Each Snort rule in this ruleset is 
customizable. In addition. Snort offers a set of default rules that looks for activity happening 
on non-standard ports, ports that should be closed, or ports that are known as not having 
that much traffic flow on them. The rulesets can be tailored to your network’s setup. 

3.2.2 Anomaly Detection 

Anomaly detection focuses on observing what type of data, how much data, and when the 
data is transferred on the network. Anomaly detection systems will focus on the traffic itself, 
while rule-based systems flag alerts based off of found content. Anomaly detection will 
look for oddities and strange patterns in the traffic. A great example of a network anomaly 
detection system is Bro. 

Overview of Bro 

The goal of Bro is to try and analyze a large amount of traffic to find anomalies. Bro uses 
its own proprietary language to run its system. This proprietary language can also be used 
to customize what data is collected and where it is stored. This system will then collect logs 
that can be parsed later. 

How Bro Works 

Bro starts by capturing network traffic with libpcap. When capturing network traffic, a 
filter can be applied to only look for certain protocols. Paxson et al. only looked for traffic 
on Finger, FTP, or telnet. After the traffic is captured, the traffic is then passed onto the 
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event engine. This engine will ensure that all paeket headers are properly assembled before 
eontinuing. Onee the engine eheeks the paekets, an instanee will be ereated if something 
does not appear to be eorreet in the paeket header. If the paeket has a proper header, the 
engine will then assoeiate the paeket with the eorreet TCP or UDP flow. After the paeket 
is done being proeessed by the engine, the engine will then look to see if any events were 
generated. The events will then be proeessed by the engine with a poliey seript that will 
determine what happens with eaeh individual event [33]. 


Advantages of Bro 

Bro is helpful beeause it tries to look for anomalies in traflie, whieh eould be something 
like bursty traflie. When bursty traflie is seen, that eould be an indieator that a large file 
or multiple files are being exfiltrated off of the network. This is eonsidered an anomaly. 
Bro would also look at the amount of traffie to see if there is less than normal data flowing 
through the network. If there is not enough traffie flowing, that eould indieate that something 
is wrong with the network. That is also an anomaly. These eases will raise a red flag in Bro 
and will alert an administrator as to what is going on. It is then possible to use the Bro logs 
to try and find what is going on and why Bro was logging those items. By determining the 
IP that sent the data, the network administrator ean then find who the IP belongs to, as long 
as the network is doeumented thoroughly. The administrator ean then determine what went 
wrong and how serious the danger was. 

Bro is also helpful in the amount of data that it logs. Bro will look at to DNS traffie, HTTP 
requests, and if any other eonneetions attempted to be made over FTP, SSH and other major 
serviees. These logs will inelude the times and IPs that made the eonneetions. These ean 
be useful when trying to identify a potential intrusion or eompromise. 


3.2.3 Combined Approaches 

A “eombined approaeh” means what its name implies, a eombination of multiple approaehes 
to ensure data is not exfiltrated off of networks. This approaeh looks at using both rule- 
based systems along with anomaly deteetion systems. By using a eombined approaeh, that 
provides more opportunities for a system to find potential problems that might arise on a 
network. 
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Giani’s Work 

Giani et al. analyzed the different ways that data could be exfiltrated, and whether or not 
the speeds that were observed would be sufficient. They looked at methods like CD/DVD 
burning, using a T1 internet line, a T3 internet line, cable or DSL line, and printing out 
the data [15]. The T3 internet line was the quickest at exfiltrating data, ffowever, when 
they were performing the data exfiltration, they had to ensure that they were covert in their 
mission. Middle ground must be achieved where they could exfiltrate the most data, but 
still not be caught or found. 

This is considered a combined approach because the authors looked at many different factors 
when considering how the data should be exfiltrated and whether or not this data would be 
recognized. They looked at the statistics to determine what method would be the fastest, 
then looked at what the best method to avoid signature-based detection systems and anomaly 
detection systems. 


3.3 Content-based Methods for Preventing Document Ex¬ 
filtration 

Content-based methods dive deep into the data itself to try and find evidence of exfiltration. 
These methods use content-based signatures, packet reassembly, and approximate matching. 
Content-based signatures try to identify certain byte patterns and will trigger signatures 
when those byte patterns are found. Packet reassembly will be used by some firewalls to 
reassemble the data before it is delivered to the destination. This is useful when trying to 
find potential exfiltrated files that should be found as a 100% match. 


3.3.1 Content Inspection with Exact Matching 

When inspecting content for exact matching, an exact match of a piece of data has been 
found that was being searched for. That is helpful when needing to identify whether or not 
a specific piece of data was found. This concept however does have a drawback. If the file 
that is being searched for has been altered, even by one byte, that would cause the data to 
not be found by the system. 
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Content-based Signatures 

An example where content-based signatures can be used is with Snort. As mentioned earlier 
in Section 3.2.1, Snort allows a ruleset to be implemented for each instance. An example 
ruleset might contain special content looking for certain things, such as the starting bytes 
for a RAR file. Every RAR file they look for has the exact same header. Once the header 
is found, that packet will not leave their network. Because their example setup had RAR 
files storing important things, that made it easier to detect whether or not that file was going 
to leave the network. They decided to not have any RAR files pass through the network. 
Whenever a piece of a RAR file was seen. Snort dropped the packet. 

This prevents insider data exfiltration with RAR files only. Due to the commonality of four 
bytes in a RAR file header, this will allow Snort to look for something specific each time. 
Only one signature has to be built and constructed. The RAR files will never be permitted 
to leave the network because of the signature that is in place. 


Packet Reassembly 

Glavlit, developed in 2006, provides reassembly and inspection of data at the application 
layer, but requires all authorized transmissions to be vetted manually in advance [34]. The 
guard overseeing the traffic that will be run through the Glavlit program will divide each 
file seen in network traffic into 1024 byte chunks and generate a hash for each of them 
using SHA-1. Once the hash is created for the fragment of the file, the guard will then 
verify that the file is known good and allow the file to pass. If the guard determines that 
the file is known as not good, then the file will not be permitted to leave the network. For 
their experiment, the guard was receiving 3000 requests per second. The guard was able to 
process the packets as quickly as the server was getting the packets to the guard [34]. The 
performance shows that the throughput was slightly over 11Mbps when the file size was 
above 10KB. 

In this instance, the guard was a machine that was able to automatically analyze the packets 
in real-time. Initially, there were some concerns about the speed of the traffic that was 
passing through the guard. However, at the end of the experiment, the speed of traffic that 
flowed through the guard equaled that of the traffic that was not being sent through the 
guard. 
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3.3.2 Detecting Exfiltration with Approximate Matching 

When using approximate matching to find potentially exfiltrated data, a hash will match in 
a database or data store. Hashes are small and easy to store. They are easy to compute and 
in the case of this thesis, not used for security purposes. Hashes are reliable, which helps 
when comparing them to find potential files that might have left the network. 


Finding Targeted Data in Secondary Devices 

Roussev’s work focused on using the National Software Reference Library, or NSRL, 
provided by NIST to attempt to compare the files found with those they attempted to 
exfiltrate with their experiment. Foster used sector hashes of secondary media drives to 
correlate the hashes of similar sectors. Garfinkel and McCarrin worked on hash-based 
carving, which detects the presence of target data on secondary media by looking at the 
individual data blocks. 

In 2009, Roussev discussed a tool that would be useful in trying to identify files, hashing. 
A cryptographic hash function is applied to the files on a file system, or to the file system 
itself [35]. The output obtained from the cryptographic hash function are hashes. These 
hashes can then be compared against the National Software Reference Library, or NSRL. 
This library contains many known operating system and file system specific files. Many of 
the hashes obtained might match up to those in the NSRL. Those hashes were filtered out 
to get the hashes that do not match anything in the NSRL database. Those hashes could 
potentially match other files, such as Word documents and PDFs from different computers. 
By hashing the entire file, there is an opportunity to find if a file is on multiple systems. 

In 2012, Foster attempted to find documents from a data corpus on a system [36]. Her work 
focused on finding documents by looking at sector hashes. When generating hashes for 
each of the sectors used in the experiment, databases were generated to compare the hashes 
with other data in different corpora to see if there were any similarities between the data. 
The govdocs corpus, as well as a few other data sets, were used. If the sector hashes did not 
match others in the data set, that indicated a unique sector of data might have been found. 
Otherwise, the sectors that appeared multiple times were ignored. 

In order to test her experiment, Foster had to create multiple databases using different 
programs. When these databases were created, five different databases were built, one for 
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each program. This was to best determine what database would be used. It was determined 
that the query rate for everything below 100 million rows when calculating transactions per 
second worked best with SQLite. SQLite also worked best when looking at query speeds as 
well. These databases showed which type of program was best in order to build the various 
databases that need to be queried to find the data. 

Foster’s work is important because of its ability to look at databases that involve many rows 
of hashes. These hashes were used to find documents that existed in a real data corpus. 
These rows of hashes could be queried against a known data set of files to determine what 
matches and what does not. By determining the uniqueness of the data set, it could be 
determined that the data found was either interesting or not interesting. The more unique 
the data, the more interesting it becomes. This work is important because of the amount 
of hashes that were stored in the databases. There were many hashes stored, and that is 
something that we needed for this thesis. 

As mentioned in Section 2.3.6, “Hash-based carving” was developed to detect the presence 
of target data on secondary storage media by evaluating the hashes of individual data 
blocks [28]. Hash-based carving finds target data by dividing each targeted file into sector¬ 
sized fragments and storing these in a database (Garfinkel and McCarrin use the hashdb 
database, which was built for this purpose [30]). This database stores hashes of fragments 
of files that can be compared to sectors on hard disks to determine if a file is or was present, 
even if the fragments are not stored contiguously. Our scheme will apply this approach to 
network traffic. 

There are two other software applications that are involved with hashes: ssdeep and sdhash. 
ssdeep focuses on fuzzy hashing, which looks at trying to find similar sequences of bytes, 
but not necessarily in a similar order, sdhash is a tool that focuses on comparing two 
blobs of data to look for strings in common between the two blobs. Both of these software 
applications while important, do not pertain to this thesis. 


Application to Detecting Document Exfiltration 

Two systems have been created within the past few years that claim to help with detecting 
data exfiltration: the PROOFS system and a max-hashing system. The PROOFS system 
looks at signatures for digital objects, and stores digital objects from many machines on a 
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network on this one system. The max-hashing system proposed by Larbanet et al. observed 
traffie over a 40/100GbE Ethernet card to try and detect MP4 files specifically going over a 
network. 

PROOES, or Proactive Object Eingerprinting and Storage, is a system developed by Shields 
et al. to create and store the object signatures for digital objects [37]. This system is 
connected to many machines on a network. Anything that is done on any of the systems will 
be logged with an object signature on the PROOES system. This is to monitor and track 
what each machine on the network does. This signature contains the metadata associated 
with each action. The signature also contains multiple digital fingerprints that can be used 
to search and track the data at a later point in time. 

In order to effectively keep track of the data on the network, they created a dictionary 
containing fingerprints of the objects [37]. In order to create the fingerprints, tokens from 
documents must be pulled. Shields et al. used Oracle’s Outside In program to remove the 
text from Word documents and PDE files. The text is then filtered with common stop words, 
symbols and numbers being removed. The processing is done by creating tokens based off 
of the types of documents that are being fingerprinted. The compressed fingerprints take 
up 375 bytes worth of data [37]. Once the fingerprints are created, the signatures will be 
created as well. The signatures are a combination of the fingerprints that are stored and 
the metadata that is sent from the system. Eile writes, file creations and other things can 
determine what the metadata will contain. To best run their experiments, they used the 
Enron data set. This dataset contained information and documents that they could use in 
order to create dictionaries to determine if this data could be found. After the trial run, it 
showed a success rate of about 96%, meaning that they were able to successfully identify 
96% of the files that they knew were inside the dictionary. 

There are some limitations with the system. The system is probabilistic, meaning that the 
system will take an educated guess whether or not the file is there. Shields et al. explains 
that they need other forensics tools to ensure that all forensic evidence is used. The authors 
also mention that this tool is not meant to replace any tool, but to be used in addition to 
other tools [37]. 

Within the past year, Earbanet et al. recently created a data exfiltration detection program 
that will look at traffic that is flowing over a 40/100GbE Ethernet card [38]. Their goal was 
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to find MP4 files in their network traffic at high speed. In order to accomplish their task, 
they computed the fingerprints of all their files they wanted to check against inside of the 
network traffic. They then decided to check for TCP and UDP flows. By checking for IP 
addresses and the ports in the traffic, they were able to successfully identify and match up 
certain flows with each other that they could use later. This would give them a better idea 
of how many flows went back and forth between two IP addresses. If they were able to 
determine IP addresses and ports belong to the same flow, the CPU was then alerted to start 
searching through the fingerprints of the data to try and find matches. 

The max-hashing algorithm system proposed by Larbanet et al. in 2015 approaches the 
problem by computing the hash value of each fixed-size, small window in a data set, then 
keeping only the one with the maximum value [38]. Larbanet et al. when completing 
their work only used FTP connections to download their files. They were able to use 
those FTP connections to their advantage and collect traffic on their own Linux servers to 
see if they were truly able to find their files going across their network connections. For 
memory transferred from the Direct Memory Access to the GPU memory, the rate was 
about 50Gbps. The speed depended on the transmitted block size [38]. The fingerprint 
generation was tested at random. There were 768 megabytes worth of files randomly chosen 
and there were “four fingerprints per 1536 B blocks” [38]. The kernel was able to process 
at a rate of 119.9Gbps and generate 40 million fingerprints per second. 

This thesis is different than their project in one key aspect: this thesis focuses on trying to 
identify fragments of files within HTTP traffic. We are taking the hashes of the payloads 
of HTTP traffic. We will attempt to look for these hashes in the hashdb databases that 
we create. This is similar to the fingerprinting technique that Larbanet et al. uses in their 
experiment. 


3.4 Air-Gapping 

As mentioned earlier, an air-gap is a form of isolated network that prevents a computer from 
accessing any outside networks [39]. By having computers strictly isolated from networks 
and other computers, this strategy forces exfiltrators to rely on out-of-band channels such 
as external media. A USB flash drive, CD, or floppy disk are examples of what could 
be used to get new data to an air-gapped machine. This makes exfiltrating data off of 
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these maehines more difficult, by forcing malicious actors to have physical access to the 
air-gapped computer. 
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CHAPTER 4: 
Methodology 


This chapter covers our methods and techniques. We conducted several experiments to 
detect data exfiltration by comparing the hashes of packets’ payloads to a database of 
blacklisted file fragments. The Govdocs data set was used to simulate the blacklisted 
content. The files from Govdocs were hosted on a server, and downloaded over HTTP. In 
addition to the blacklisted files, there was also noise data. This data does not belong to the 
blacklist. However, this data is important, as it is supposed to represent actual web traffic 
that could produce false positives when trying to find blacklisted files. 


4.1 Hardware 

We built our database using a 64-core machine with 512GB of RAM. In addition, there was 
over I4TB of available space on a network share to use to build the databases. The web 
server was capable of holding all Govdocs files and allowed for the use of HTTP over port 
80. The Mac laptop has 16GB of RAM and a 512GB SSD. Figure 4.4 shows a diagram of 
the hardware. 


4.2 Data Sets 

To simulate data leaving the network, we used two subsets of the Govdocs data set, which 
we label Sample 1 and Sample 2. Both subsets were further divided into target data and 
noise data. Target data means this is a part of our blacklisted files: files that we hope to 
find when scanning against our target traffic. Noise data represents authorized data that we 
expect to find within our network traffic, however, it is traffic that we are not looking for. 
This is extraneous data that is found that allows us to simulate an actual network. 


4.2.1 Govdocs 

We used Govdocs as the data set for this thesis [20]. It was established as a file set to use 
for forensics research that everyone can use. Govdocs contains approximately one million 
files that are freely distributable. The one million files are divided into 1,000 separate 
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directories, each containing around 1,000 files. These files are all open source and can be 
used by anyone. To build the corpus, Garfinkel et al. performed basic searches of the .gov 
domain with Google and Yahoo in order to download the one million files [20]. These files 
include Word documents. Excel spreadsheets, PDF files, text documents, HTML webpages, 
and many other data types. 


4.2.2 Govdocs Sample 1 

Sample 1 contained the first 100 files from Govdocs directory 002, which contains 990 
files. These files were chosen because they were the first 100 files in the directory. Out of 
the 100 files downloaded, the first 50 of those files were selected as target data and stored 
in our database, the second 50 files were noise data. Noise data consists of files that we 
downloaded that we know are not a part of the target data set that we are looking for. Sample 
1 contains 6.4 MB of target data, and 13.6 MB of noise data. The amount of target data is 
about half that of the noise data for the first sample. There were 50 files chosen as it was a 
small sample size that we could test and obtain fast results. 

Table 4.1, shows statistics about the 100 different files in Sample 1. The columns indicate 
how many of each file type were downloaded for the target data and the noise data. Because 
only 100 files were used in Sample 1 and our selection method did not ensure a uniform 
distribution across file types, there was not a wide variety of file type. Most of the target 
data was HTML files (45 out of the 50). With respect to the noise data, 38 out of 50 files 
were text documents. 

The one UNK file that was found was actually a text document that was a court ruling, but 
which did not have a .txt extension. This document could have probably been interpreted as 
a text document, as it could be opened with a text editor. This file was interpreted as UNK 
when it became a part of Govdocs. 


Extension 

Target 

Noise 

Extension 

Target 

Noise 

pdf 

2 

2 

txt 

3 

37 

html 

45 

1 

UNK 

0 

1 

jpg 

0 

8 

text 

0 

1 


Table 4.1: The distribution of file extensions in the Govdocs Sample 1 corpus. 
The Govdocs Sample 1 corpus contains mostly human readable documents 
and images. Adobe PDF, HTML, text, and JPG files make up the dataset. 
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4.2.3 Govdocs Sample 2 

Sample 2 contained one 990-file directory, directory 002, and one 991-file directory, di¬ 
rectory 013. We used the files from these directories to create our databases. The set of 
990 files was our target data, the other 991 files were noise data. Sample 2 contains 387 
MB of target data, and 536 MB of noise data. There was a 149 MB increase in noise data 
compared to the target data. 

Table 4.2 columns show the differences in the 1,981 files in Govdocs Sample 2. The most 
common types of files that were downloaded for the target and noise data were PDF files, 
TXT documents and HTML webpages. Through our block level analysis, we find that most 
files have correct extensions that match the file type, but some files do not. Many of the text 
files found contained HTML data, meaning that they could also be interpreted as webpages. 


Extension 

Target 

Noise 

Extension 

Target 

Noise 

pdf 

242 

271 

log 

3 

3 

txt 

160 

128 

wp 

1 

0 

html 

292 

241 

UNK 

5 

2 

tex 

2 

1 

jpg 

97 

81 

gls 

1 

0 

xml 

10 

12 

text 

3 

3 

rtf 

0 

1 

f 

0 

1 

doc 

64 

71 

xls 

21 

26 

ppt 

54 

53 

gif 

12 

57 

ps 

10 

13 

dbase3 

2 

1 

CSV 

6 

14 

gz 

4 

9 

java 

1 

3 


Table 4.2: The distribution of file extensions in the Govdocs Sample 2 corpus. 
The Govdocs Sample 2 corpus contains mostly human readable documents 
and images. Adobe PDF, Microsoft Office, HTML, text files and graphical 
image files make up the majority of the corpus. 


4.3 Byte Alignment 

When files are transferred over a network, they must be divided into fragments so that no 
packet can exceed the network MTU. This can present problems for observers attempting 
to monitor the network for unauthorized exfiltration because it is difficult to match small 
pieces of content with whole blacklisted files. In addition, the first packet in every stream 
contains an HTTP header. The size of the header can vary. The location of the first byte 
of a file being downloaded depends on header size. The inclusion of the HTTP header 
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makes finding the start of data more diffieult. Because the HTTP header exists at the start 
of a traffic stream to downioad a fife, the HTTP header caused probiems trying to find the 
fragments of fifes. 


HTTP/1.1 2QQ OK 

Date: Tue, 24 May 2016 19:52:32 GMT 
Server: Apache/2.4.7 (Ubuntu) 

Last-Modified: Fri, 06 Feb 2009 01:52:52 GMT 
ETag: "7fe0-46236475ee500" 

Accept-Ranges: bytes 
Content-Length: 32736 
Vary: Accept-Encoding 
Connection: close 
Content-Type: text/html 


This is an example of an HTTP header. We chose HTTP because it represents one reasonable 
channel for data exfiltration, even on restricted networks. 

We solve these problems by dividing our blacklisted files into fragments in advance, and 
matching these fragments to the fragments in the network traffic. To match the fragments 
correctly, we had to align the bytes of data with the bytes in the network packets. This would 
allow us to match the most hashes to a packet’s payload. Since the varying HTTP header 
size makes it difficult to predict where a fragment of payload will begin or end in the file 
being transmitted without actually parsing the header itself, we populated our database with 
all possible alignments in each blacklisted file. This is why a small step size of 1 byte was 
used. To populate our blacklist database, we began at the start of each blacklist file, hashed 
a fragment, then moved over one byte, hashed that piece of data, etc. This guaranteed that 
all fragments would be present in our database. The hashdb utility provides options to allow 
the user to control how fragments are ingested, as necessary. See Figure 4.1 for a diagram 
that describes our ingestion process. 

For this thesis, a byte alignment of 1 was necessary to find the target files in the traffic. 
Because of the nature of our data, with TCP and HTTP headers included of arbitrary length. 
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File going into network traffic 
(http) 


hi HTTP header 


hi 

P1 

Other 

h2 

P2 

other 

h3 

p3 


1024 > 

^ 1024 1004 



P1 


P2 


P3 





Figure 4.1: Diagram of how the files were converted into hashes. 

The file going into network traffie was split up into pieees, due to the MTU of the network. 
We took the first 1024 bytes of payload from these Internet paekets and took an MD5 hash 
of them. We then looked at hi, h2 and h3 shaded in yellow to see if they were in our 
database. We split the file into pieees, and ereated MD5 hashes of the pieees of files. The 
hashes were 1024 bytes eaeh. These generated hashes were then plaeed into a hash 
database, where the hashes of paeket payload eould be eompared. 


having a byte alignment of 1 was the best option to make sure that we found the most hashes 
inside the database. If the byte alignment was inereased up to 2, a problem might oeeur 
in that the bytes would be shifted just enough to eause problems and we would not find as 
many hashes that mateh. Even if the fragments were misaligned by one byte, the fragments 
would appear as not found in the network traffie, even though we knew that the files existed 
in the traffie. See Figure 4.2 to see how data bloeks are hashed. 
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Figure 4.2: Figure 4 from hashdb user manual, showing how data blocks are 
hashed. Source [30]. 


4.4 Choosing a Block Size 

We found when trying to download files that most paekets had a TCP payload of 1448 
bytes. This was beeause the overall size of the paeket was 1514 bytes. In addition to the 
payload, we also took the Ethernet header, IP header, and TCP header into aeeount. In our 
experiment, the Ethernet, IP, and TCP headers typieally added up to 66 bytes. There were 
instanees where the header size was 74 bytes. This mostly happened in ACK paekets. We 
ehose our bloek size for our blaeklist based on the typieal payload size. Initially, the bloek 
size was 1448 bytes, the size of most paekets’ payloads that we were looking for. 

We attempted multiple different experiments, where we attempted to hash along 1024, 
1200, and 1448 byte boundaries. When hashdb took a 1448 byte hash at every byte inside 
the direetory, it ereated 6, 678,056 hashes in the database to eompare it to. When hashdb 
took a 1200 byte hash at every byte inside the direetory, it ereated 6,695,346 hashes in the 
database to eompare it to. When hashdb took a 1024 byte hash at every byte inside the 
direetory, it ereated 6,712,048 hashes in the database to eompare it to. 

When hashing, we did not have to mateh on 1448 byte boundaries. By having a smaller 
boundary, this would allow our experiment to work on different networks that might have 
smaller payload sizes than the network that we were using. When hashing along boundaries, 
we eould not hash the first 1024 bytes of data, then simply move on to the next 1024 bytes. 
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and so on. The data could not be aligned properly. Due to the variation of the HTTP header 
and MTU, the hash of the first payload would not be able to match any hash that we created. 
Table 4.3 shows how many hashes were created in each database. 


Experiment 

Files 

Block Size 

Hashes 

1 

50 

1448 

6678056 

1 

50 

1200 

6695346 

1 

50 

1024 

6712048 

2 

990 

1024 

412551239 


Table 4.3: Hashes in each hashdb database created. 


Explained are the amount of hashes that are in each of the databases created for this thesis. 
The table explains each experiment run, how many files are in each database, the size of 
the data that is hashed and the amount of hashes in the database. 


4.5 Building the Target Databases 

We created the databases using a combination of hashdb and bulk_extractor. As mentioned 
earlier, (see Section 2.4.1) bulk_extractor is capable of processing large amounts of data 
at once. We can take advantage of this by using hashdb as a scanner feature with a 
bulk_extractor instance. Using the hashdb scanner, bulk_extractor can add hashes of the 
data from bulk_extractor’s run to a specified hashdb database. 

Multiple options are necessary to correctly configure bulk_extractor. The “-E” op¬ 
tion followed by a scanner tells bulk_extractor to use no other scanner than the one 
that you have provided. The “-S” option allows the user to pass configuration op¬ 
tions to individual scanners. Eigure 4.3 shows an example command line invocation of 
bulk_extractor. The “-S hashdb_mode=import” is used to import the data as hashes. The 
“-S hashdb_byte_alignment=l” and “-S hashdb_step_size=l” are used to indicate the size 
of byte_alignment and step_size. The step_size is the interval at which data is hashed. The 
byte_alignment determines what byte size the data is hashed along. When the step_size is 
1, the byte_alignment also must be 1. When working with the byte_alignment, that value 
has to be divisible by the step_size. In many instances, the byte_alignment is large to reduce 
the space required to store the hashes in the database. The “-S hashdb_block_size=1024” 
option determines how many bytes we are using to create each MD5 hash, for example 1,024 
bytes. The “-R <input directory>” option is to provide bulk_extractor with a directory that 
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it can use to process data. The “-o <output directory>” option is to give the output directory 
for the hashdb database. 

bulk_extractor -E hashdb -S hashdb_mode=import -S hashdb_byte_aligimiGnt=l 
-S hashdb_step_size=l -S hashdb_block_size=l®24 -R <input dirGctory> 

-o <output directory> 


Figure 4.3: Instance of bulk_extractor run. 

The time required to constructed the database depends on the number of hashes inserted and 
the available resources. When using these parameters with bulk_extractor, we were able to 
build a database of the fifty files from Sample 1. This is the control hashdb database. We 
were able to use this later in the experiment to know which hashes and files existed in the 
database. 


In order to calculate the number of hashes h that we expect our database to contain, we use 
the following equation, where n is the number of files, x is the average file size, k is the 
block size, and y is the step size. 


h = n 


+ 1 


y 


We then built a database containing 990 files, initially using the aforementioned options. 
However, this database took a long time to build and bulk_extractor timed out because 
hashing each block repeatedly took more time than bulk_extractor expected. Another 
reason bulk_extractor timed out was the size of the files that the program was dealing with. 
By default, a thread in bulk_extractor cannot run for more than 60 minutes. Some of the 
threads ran longer than 60 minutes due to the amount of data that had to be processed by 
bulk_extractor. To resolve this problem, we divided the 990 files into 12 folders. This was 
so bulk_extractor did not have to process as much information as it did with 990 files in an 
instance. One folder only had one file due to the file’s large size. bulk_extractor struggled 
to process this one file because of its size. It had to be placed in a directory all on its own 
in order to be processed. When processing each individual file directory, it took a little 
over an hour on average in order to build the databases for each of the 12 folders. By only 
opening the hashdb database once in the program, it brought down the program run time 
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by about 75%. 


A master database full of hashes from the 990 file directory had to be created. The smaller 
databases were already built, the next step was to combine them. There were 12 different 
databases that were built. Each of these 12 databases had to have their hashes combined 
into one master database, hashdb has a feature that allows you to copy the hashes from one 
database into another. This was used 12 times to ensure that each of the 12 directories that 
were hashed would be inserted into this database. When all of the hashes were combined, 
it equated to about 415 million hashes in one master database. 


hashdb add <source hashdb directory> <destination hashdb directory> 


In order to determine the master database was working properly, we ran test traffic sets 
through the master database to ensure that all of the files we expected to find we did in fact 
find. We were able to run Experiment I’s data through the master database as a test, and 
successfully find all fragments. 


4.6 Capturing Traffic 

We started with a server that hosted files from Govdocs (mentioned in Section 2.3.1). The 
next step was to collect the data. We used a Python program to select 100 sequential files 
from Govdocs. This Python program sets up a web connection to download each of the 
files given to record the filename, HTTP code, IPv4 of web server, source port, destination 
port, and size of the file downloaded. While this program was selecting 100 sequential 
files, we had an instance of tcpdump running to capture the data the Python program was 
collecting and put the packets into a pcap file. This pcap file contained the packets that 
were downloaded using the Python program. The first pcap file downloaded was used in 
Experiment 1. 

After we collected a 100 file capture, we then collected the data that we would use for Exper¬ 
iment 2. Eor our second experiment, we wanted to collect data that would be representative 
of a network that has many packets flowing through it. The second experiment focused 
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on downloading two 991 file directories and storing their content in pcap files. Figure 4.4 
shows how the files were downloaded onto the Mac laptop. 



Network Share 
(File storage) 


Figure 4.4: Diagram of server setup, how files were downloaded. 


We created a Python program to analyze each packet, matching packets with an individual 
file ID from govdocs to produce a list of packets and the file IDs that are associated with each 
packet. The program requires three things to input: the pcap file, the file IDs for the different 
files, and the file generated during the download process explaining each file and what flow 
tuple it corresponds to. The program looks at each flow tuple and determines which file ID 
should belong to each packet in the flow tuple. Then, once the files are correctly identified, 
the output from the program is a list of packets and the file ids that correspond to them. This 
file is the ground truth; where we know exactly what file is associated with what packet. 

4.7 Analyzing Traffic for Target Data 

Figure 4.1 displays what has occurred so far in the thesis. The files travel into network 
traffic via HTTP. Once the files are inside the network traffic, each packet’s payload is then 
pulled out and the first 1,024 bytes of that payload is hashed. At the same time, the database 
of blacklisted files contains hashes of file pieces. Once the hashes are in the database, the 
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database is then queried to look up the payload’s hash to see if it exists inside the database. 
The results are then indicated, whether the hash of packet payload existed in the database 
or not. 


For each of the pcaps that we captured, we run the program that we created to get the hashes 
of each payload for each packet. Once these hashes are created, they will then be looked up 
in our blacklist database to see if the hash generated by a packet’s payload matches up against 
something in the database. If it does, that might indicate that target data was downloaded. 
However, if we did not find files that we know should be in the traffic, we investigated to 
determine why those files were not found. Our program used a "scanning" function that is 
included with hashdb’s API. This opened up a hashdb database and we passed in the hash 
of the packet’s payload to see if it was in the database. If the payload’s hash existed, a “1” 
was output, indicating that a matching hash was found, along with the packet number in 
which the payload’s hash was found. If the payload’s hash did not match up with anything 
in our database, the packet was marked with a “0” for false. In addition, we also output the 
ISON string of the hash’s content that was found. This includes file directory, block size, 
entropy, and other facts about the file and its hash. This is so we can identify which file was 
found. In Table 4.4, there is an explanation of the files downloaded, how many packets we 
are looking for, and how many packets are in the overall traffic. 


Experiment 

Capture 

Files 

Target Packets 

Total Packets 

1 

1 

100 

4575 

9635 

2 

1 

200 

31284 

54822 

2 

2 

200 

33691 

79873 

2 

3 

200 

22890 

158451 

2 

4 

200 

41956 

72481 

2 

5 

199 

21327 

36981 

2 

6 

200 

24626 

52186 

2 

7 

200 

30963 

52495 

2 

8 

200 

26342 

73342 

2 

9 

200 

25502 

44323 

2 

10 

183 

20576 

47303 


Table 4.4: The statistics of each experiment and file capture. 


Each of the captures we completed is associated with an experiment. We noted how many 
files were downloaded in each capture, as well as the amount of target packets and total 

packets in each capture. 


If the hashes were found, that means the method we used worked. However, if the hashes 
were not found, then we needed to go back and investigate why the hashes were not found 
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and what parameters can be changed in order to find the hashes of the payloads inside the 
hashdb database. 


4.8 Scoring Our Results 

We are going to be looking for four things when we look through the traffic: true positives, 
false positives, true negatives, and false negatives. True positives are those hashes where 
we know that target data is in the packet and the packet will show as having target data in it. 
False positives are those hashes where the packet shows up as having something we want 
in it, but we know does not contain the content in the packet that we are looking for. True 
negatives are those hashes that do not find a match, and content is not a payload that we are 
looking for. False negatives are those hashes where we do not find a match, and we know 
that the hash should be associated with the packet. These categories are summarized as 
follows: 

• Positive: A hash of a packet payload appears in our target database. 

• Negative: A hash of a packet payload does not appear in our target database. 

• True positive: A hash of a packet payload appears in our target database AND the 
packet comes from transfer of a target file. 

• True negative: A hash of a packet payload does not appear in our target database 
AND the packet does not come from transfer of a target file. 

• False positive: A hash of a packet payload appears in our target database BUT the 
packet does not come from transfer of a target file. 

• False negative: A hash of a packet payload does not appear in our target database 
BUT the packet does come from the transfer of a target file. 

The overall goal is to produce only true positives and true negatives. If there were a high 
number of false positives, it required some fine tuning to ensure that we were looking to 
find the target data, not the noise data. If there were a high number of false negatives, that 
meant that there was a problem with the program that we created. 

In order to determine what the true positive, true negative, false positive and false negative 
values were, the ground truth file and the output from the traffic parser program were put into 
a Python scoring program. This program is written to read in the two files, then determine 
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whether or not the packets match between the traffic and the ground truth. The program 
then determines the positives and negatives, then displays them on the screen. As long as 
the payload is determined to be true and match a hash in the database, and have a file id be 
with the same packet, that is considered a true positive. 

Once the tests were complete, we focused on analyzing problems and refining our method 
to bring down the false positives and false negatives that we found within our traffic. 
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CHAPTER 5: 
Results and Analysis 


In our first experiment, we find that the smallest block size tested (1024 bytes) gave the 
best results. In our second experiment, we use this block size and perform 10-fold cross 
validation on Govdocs Sample 2 to determine the effectiveness of our method. 

We used folds, or parts of the experiments, to determine the results using the programs and 
methods described in Chapter 4. The results for each experiment showed how many true 
positives, true negatives, false positives, and false negatives were associated with each fold. 


5.1 Experiment One: Testing the Impact of Hash Block 
Size on Precision and Recall with Govdocs Sample 1 

After examining the results of our experiments with 1448, 1200, and 1024 bytes, we 
determined that 1024 byte fragments produced the best outcome. In Table 5.1, the precision 
and recall are noted for each result with different block size. 

As indicated in Figure 5.1, the first evaluation of our initial experiment with hashes of 1,448 
byte segments yielded 4479 true positives, 4717 true negatives, 25 false positives, and 96 
false negatives. 

The false positives are attributed to two PDF files that have similar content. The target 
PDF file was a preliminary report, 002021.pdf, and the PDF file that was generated false 
positives was a final report, 002086.pdf. Figures 5.2 and 5.3 show the content of the two 
PDFs. The two sizes for the files are 754KB and 1.2MB respectively. The other file that 
generated false positives was 002096.pdf. This file is not similar to the other two. There 


Block Size 

Precision 

Recall 

1448 

97.9% 

99.4% 

1200 

98.0% 

99.4% 

1024 

98.2% 

99.4% 


Table 5.1: Precision and Reca 


I for each o 


three potential outcomes. 
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Prediction outcome 


p n total 

P' 

actual 
value 

n' 


total 4504 5214 9317 

Figure 5.1: Confusion matrix, hashes generated from 1448 byte segments 



25 


4717 


4575 


4739 


were 23 false positives from 002086.pdf and 2 from 002096.pdf. 

After further investigating, we determined that the false negatives eould be explained by 
two features of the file transfer meehanism we employed. First, HTTP header data inside 
the first packet of content for each transferred file prevented hashes of that packet from 
matching hashes in our target database. Because the HTTP header is of arbitrary length, 
this is something that we had to account for when looking for the hashes of the first packets 
in a stream. All packets after the HTTP header were found the majority of the time. Second, 
most of the files did not end exactly on 1448 byte alignment, meaning that many of the final 
packets in a flow did not match. There were 50 header packets and 46 final packets the 
program could not find. 

Even though we found 46 final packets that the program could not find, that still left 4 
packets unaccounted for. All payloads were less than 1448 bytes, as indicated in Table 5.2. 
However, there were two files, 002028.html and 002007.html, that had file size less than 
1448. The header and final packet were the same in the stream. The other two final packets, 
from files 002008.html and 002047.html, did not appear either. In this instance, both files 
had a final payload size less than 100 bytes, meaning that our Python script had an issue 
attempting to identify small packet sizes. 

As indicated in Figure 5.4, the second evaluation of the first experiment with hashes of 
1,200 byte segments yielded 4483 true positives, 4716 true negatives, 26 false positives, 
and 92 false negatives. 
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Diamond Sawblades and Parts Thereof 
From China and Korea 


Diamond Sawblades and Parts 
Thereof From China and Korea 


Investigation Nos. 731-TA-1092-1093 (Final) 


Investigation Nos. 731-TA-1092 and 1093 (Preliminary) 



Figure 5.2: First page of002021.pdf Figure 5.3: First page of002086.pdf 


Block Size 

1448 Bytes 

1200 Bytes 

1024 Bytes 

# of Fragments below Block Size 

50 

46 

37 


Table 5.2: Fragments below block size. 

This table demonstrates the final fragment sizes and their relation to block sizes used. 

Prediction outcome 
p n total 

P' 

actual 
value 

n' 


total 4509 4808 9317 

Figure 5.4: Confusion matrix, hashes generated from 1200 byte segments 



4575 


26 


4716 


4742 


Compared with the 1,448 byte segment hashes, the 1,200 byte segment hashes provided us 
with four fewer false negatives, one more false positive, and one less true negative. Packet 
numbers 85, 2880, 2918 and 2944 showed as true positives instead of false negatives. This 
is because these packets were the final ones in their respective streams. Packet 7592 showed 
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as an additional false positive instead of a true negative. Paeket 7592 has a payload size 
of 1448, but the hash of the first 1,200 bytes of that payload was recognized as belonging 
to 002021.pdf. There were 24 false positives from 002086.pdf and 2 false positives from 
002096.pdf. Therefore, there is an additional false positive added to 002086.pdf and the 
same amount as found in 002096.pdf. 

As indicated in Figure 5.5, the final evaluation of the first experiment with hashes of 1,024 
byte segments yielded 4492 true positives, 4715 true negatives, 27 false positives, and 83 
false negatives. 


Prediction outcome 



P 

n 

total 

p' 

4492 


83 

4575 

actual 





value 




n' 

27 


4715 

4742 

total 

4519 

4798 

9317 


Figure 5.5: Confusion matrix, hashes generated from 1024 byte segments 


Compared with the 1,200 byte segment hashes, the 1,024 byte segment hashes provided us 
with nine fewer false negatives, one more false positive, and one less true negative. Packet 
numbers 76, 211, 273, 344, 2033, 2040, 2725, 2871 and 3886 showed as true positives 
instead of false negatives and packet 8391 showed as an additional false positive instead 
of a true negative. Packet 8391 has a payload size of 1448, but the hash of the first 1,024 
bytes of that payload was recognized as belonging to 002021.pdf. There were 25 false 
positives from 002086.pdf and 2 false positives from 002096.pdf. There was once again 
an additional false positive that belonged to 002086.pdf, and the amount in 002096.pdf 
remained the same. The same two false positives for 002096.pdf were found for each of 
the three evaluations. This is probably an instance where the data contained in the file 
matched to content in another one of the files that was in the database. With regards to 
the false positives of 002086.pdf, 002021.pdf and 002086.pdf are so close to each other, it 
makes sense that more hashes continue to be found. As mentioned earlier, 002021.pdf was 
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a preliminary report, while 002086.pdf was a final report, so they both eontain a lot of the 
same information and data. 

After running the different evaluations of the first experiment, we were able to bring down 
our false negatives and inerease our true positives. The number of false negatives dropped 
from 96 to 83 beeause 13 paekets at the end of file transfer flows had payloads that were 
greater than 1,024 bytes, but less than 1,448 bytes. We also notieed a small inerease in the 
false positives. This is probably beeause the same quantity of matehing data was split into 
slightly more fragments when the fragment size deereased. All of the false positives eame 
from 002021.pdf. 

In addition, we found that our program was eonsistently having diffieulty in finding 4 
paekets. Our numbers eonsistently showed that we were off by 4 false negatives every time. 
There were two faetors: small files and small paeket payloads. There were two files that 
were found with an overall size less than 1448 bytes: 002007.html and 002028.html. In 
those two paeket streams, the header and final paeket was the same one. For small paeket 
payloads, there was an issue with our program. It had diffieulty identifying true negatives 
on paekets smaller than 100 bytes in size: 002008.html had a final payload size of 36 bytes 
and 002047.html had a final payload size of 19 bytes. 


5.2 Experiment 2: 10-fold Validation with Govdocs Sam¬ 
ple 2 

Results from evaluations of the first experiment led us to eonelude that 1,024 byte segments 
was a reasonable ehoiee of bloek size. When running these tests, we found that expanding 
the size of the master database had minimal impaet on the runtime of our analysis. Though 
further testing is required to measure performanee, this preliminary observation was en- 
eouraging. Table 5.3 shows the amount of positives, both true and false, negatives, both 
true and false, preeision, and reeall for eaeh of the folds. Figure 5.6 shows the Govdoes 
Sample 2 eonfusion matrix. 

The preeision averaged 98.3% and the reeall averaged 99.2% throughout the 10 folds. 

The majority of our false positives were eaused by PDFs, HTML webpages. Word doeu- 
ments, PowerPoint presentations, Exeel spreadsheets and PostSeript files. PDF files eontain 
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Fold 

True Positives 

True Negatives 

False Positives 

False Negatives 

Total Packets 

Precision 

Recall 

1 

31020 

22568 

333 

264 

54822 

99 . 0 % 

99.2% 

2 

33438 

44999 

542 

253 

79873 

98.4% 

99.2% 

3 

22672 

134473 

437 

218 

158451 

98 . 1 % 

99 . 0 % 

4 

41745 

29771 

473 

211 

72481 

98 . 9 % 

99 . 5 % 

5 

21112 

14822 

186 

215 

36981 

99 . 1 % 

99 . 0 % 

6 

24416 

26647 

266 

210 

52186 

98 . 9 % 

99 . 1 % 

7 

30676 

20011 

890 

287 

52495 

91 . 2 % 

99 . 1 % 

8 

26221 

45309 

959 

211 

73342 

96 . 5 % 

99.2% 

9 

25271 

17759 

416 

231 

44323 

98 . 4 % 

99.17o 

10 

20386 

25950 

195 

190 

47303 

99 . 1 % 

99.1% 


able 5.3: The packet results for each fold in Experiment 2. 


The precision and recall are fairly consistent for the 10 folds. The only two that were lower 
than 98% had a high amount of false positives. The recall did not go below 99%. Fold 5 
was smaller than the others. The size of the pcap was smaller by 20MB to the next closest 
capture, indicating that the download did not consist of many large files. Fold 3 had a high 
amount of packets because there was a 133MB Excel spreadsheet that was downloaded. 


actual 

value 


Prediction outcome 

total 

279247 


387006 

total 281654 384599 666253 


P n 


276957 


2290 



4697 


382309 


Figure 5.6: Confusion matrix, hashes generated 1024 bytes from all ten 
capture files 


similar data in certain parts of the file, because it is a standardized file format. Microsoft 
office documents normally have the same content at the start of each file. PostScript files are 
similar to PDF files. There were about 100 files of the target data that were PDF files in the 
target data. In the false positives, about 280 noise files were found that incorrectly matched 
to data in blacklisted files. False positives by file extension totals are found in Table 5.4. 

The reason that many fragments of PDF files in the 002 govdocs directory that matched to 
fragments in the 013 govdocs directory is that multiple PDF files had the same data. There 
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Extension 

False Positives 

PDL 

2,361 

PS 

641 

DOC 

537 

PPT 

378 

XLS 

255 

HTML 

252 

CSV 

190 

JPG 

65 

PPS 

13 

TEXT 

3 

XML 

2 

Number of fal 

se positives per fi 


were multiple eases where there were PDF files that were a preliminary report, and then a 
final report similar to the preliminary report, but not quite the same. Some Mierosoft Ofhee 
doeuments also eontained similar eontent. By having identieal 1,024 byte runs that exist, 
these files eonfused our matehing algorithm and produeed false positives. 

Out of all the false negatives, 1,633 paekets were pieees of files immediately following 
HTTP headers, the last paeket in a stream, or the header and final paeket in the stream, as 
some files were small and did not have data in more than one paeket. The HTTP header 
eaused false negatives beeause the paeket eontent had to exaetly mateh 1,024 byte hashes 
in files inside the direetory. False negatives by file extension totals are found in Table 5.5. 

There were 657 false negatives that appeared that did not mateh up with the first and last 
paekets in a stream. The majority of these false negatives were from Word doeuments, 
PowerPoint presentations, and Excel spreadsheets. When searching Microsoft Office appli¬ 
cation documents, the bytes 0x00 and Oxff for example, were seen in the first 1024 bytes of 
most of their false negatives. After hashing the 1024 0x00 and Oxff bytes, it appears that 
those hashes were not included in the hashdb database that was built. In addition, there 
were instances of an HTML file, a PDF document, a PostScript document, and two JPG 
images where there were more than two false negatives. The HTML file had a duplicate 
packet that was sent, the PDL document had the same 0x20 byte for its first 1024 bytes in 
the hash, the PostScript document had the same 0x46 byte for its first 1024 bytes for 21 
packets, one JPG had the same 0x00 byte for two false negative packets, and the second 
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Extension 

False Negatives 

HTML 

472 

DOC 

461 

PDF 

412 

PPT 

305 

TXT 

271 

JPG 

151 

XLS 

99 

PS 

35 

XML 

19 

GJF 

17 

CSV 

10 

UNK 

10 

GZ 

6 

LOG 

5 

TEXT 

5 

DBASE3 

4 

TEX 

3 

JAVA 

2 

WP 

2 

GES 

1 


Table 5.5: False negatives by file extension. 


JPG had a duplicate packet that was sent. 

There were many false negatives with PDF and HTML files, but those were the first and last 
packets per traffic stream. There were also a large amount of TXT and JPG files that caused 
multiple false negatives with the first and last packets in their respective traffic streams. 


46 




CHAPTER 6: 

Conclusion and Future Work 


This chapter discusses our conclusions and proposed future work. 


6.1 Conclusion 

Our goal was to develop a technique to find target data in network traffic without rebuilding 
a network stream by hashing the payload of packets. We tested different window sizes to 
determine the size that allows us to find the most packets that contained target data. The 
ideal window size that we found was 1,024 bytes. We think this is the ideal size because it 
maximized the amount of true positives, while also reducing the amount of false negatives. 

We were able to identify packets that were in our target data without reconstructing traffic 
streams. While identifying these packets, we were also able to identify packets accurately 
that would determine whether data was exfiltrated. Our average precision was 98.3% and 
our average recall was 99.2%. We were able to develop a program that would generate the 
results within a few minutes. We were also able to put over 415 million hashes into one 
database. 

We believe this is a feasible approach to finding pieces of files in network traffic over HTTP. 
We had a high percentage of finding pieces of files that we were hoping to find. This method 
would not work over HTTPS, unless there was an ability to break SSL. However, there were 
some issues, such as finding a hash with the same 1,024 bytes throughout the length of 
payload. When the same byte was present for the first 1,024 bytes of payload, hashdb did 
not take a hash of that data. That caused a large amount of false negatives. 

We were able to work towards our goal of creating a system that would look at a packet’s 
individual payload, take a hash of that payload, then compare the results to a database 
that contains the hashes of blacklisted files. The program was a prototype implemented in 
Python and we are leaving the development and testing of a speed-optimized version for 
future work. 

While testing bulk_extractor, we found a bug that prevented users from specifying the 


47 




amount of time to give a thread before it timed out. In addition, there was also the testing 
of new versions of buIk_extraetor and hashdb. These programs, whieh were eritieal to the 
eompletion of this thesis, were eonstantly used and 98 to eonfirm they were properly. 


6.2 Future Work 

Our experiment demonstrates that a simple approaeh of hashing the beginning of the 
payload allows us to reliably determine whether paeket payloads were part of our target 
data. However, to show that this method improves over the state of the art, we need to test 
performanee to determine how mueh throughput this approaeh ean handle. We did not run 
this system in real-time, and performed the majority of our tests post-mortem, so eolleeting 
these metries remains for future work. 

Furthermore, more testing needs to be done to determine how large the target set ean be 
given the available memory resourees. We were only able to sample 1,982 files of the one 
million files govdoes eorpus. In the future, experiments eould be done with more blaeklist 
files to test the limits of our approaeh. The hashdb database is eapable of handling many 
more hashes than what we have proeessed. However, more work needs to be done to make 
it possible to ingest larger files. Due to bulk_extraotor’s issue with only having a thread run 
for 60 minutes, some of the data might not be able to be proeessed in time. Splitting up 
the data into ehunks and hashing it was the best way to eomplete this task. Another option 
would be to update to the latest version of bulk_extraetor. 

To eomplete the implementation of an approximate matehing seheme, we would need to 
determine the probability that, giving some number of paeket matehes against the payload, 
a targeted file eould have been or is in the proeess of being exfiltrated off of the network. 
This eould be determined in part from the results of our experiment, but a more extensive 
test using real traflie would make these results more representative. For this reason, our 
approach should be tested on a more realistic dataset. That would hopefully prove that this 
system would be able to work in a larger environment than the one we tested in. 

Once we understand the probability that a packet match correlates to the presence of a target 
file, further experiments should be run in which we test different thresholds for number of 
matches or the proportion of the target file matched. Precision and recall should then be 
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evaluated again for different threshold values. These results, in addition to the throughput 
metrics, should then be compared to those of "state of the art" data exfiltration prevention 
systems. This will help determine whether or not our system would be able to compete 
against other systems like it. 

In our experiment, packets that contained the HTTP header always registered as false 
negatives. One way to fix this problem is to remove of the HTTP header. The HTTP header 
is of arbitrary length, so this would involve additional parsing. That would help increase 
the number of true positives. 

If we were able to fix the alignment problem, we could reduce space needed to store files 
in our database. We are restricted because the hashes have to be overlapped with 1-byte 
intervals in order to ensure that we captured all possible hashes. A different traffic parser 
would be necessary in order to complete this task. 

Another way the HTTP header problem could solved would be making decisions on groups 
of packets instead of an individualized approach. This approach would look at how many 
packets for each file there should be in the network traffic, and then make a decision regarding 
whether or not the file existed in the traffic, based on the same threshold of matches. 

This program solely focused on HTTP traffic, but it could be extended to other protocols. 
For example, future versions could look at e-mail, FTP, SQL, and SMB. SQL and SMB can 
be easily encrypted, making it more difficult to determine if files are in a set of traffic, unless 
the traffic is decrypted. FTP is not encrypted by default, meaning that the data transferred 
over that protocol can be observed and checked against a blacklist. However, if FTPS, or 
secure FTP is used, then that data will be encrypted and difficult to observe. Our method 
should be tested on these protocols in the future to determine its usefulness in identifying 
pieces of data in network traffic. 
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